Last week

We learned the big 5 data transformation functions in the dplyr package:

function purpose
filter() Pick observations by their values
arrange() Reorder the rows
select() Pick variables by their names
mutate() Create new variables
summarise() Create summaries

(http://r4ds.had.co.nz/transform.html)

Last week

And we used these functions to create this plot:

Today we’ll focus on tidy data

https://www.rstudio.com/resources/cheatsheets/

Objectives

By the end of this session, you should be able to:

  • Explain what makes data “tidy”, and why messy data suck
  • Tidy data using the tidyr package, a member of the tidyverse
  • Make a participant characteristics table

Login to Duke’s Docker-ized version of RStudio Server

  • Login to your instance by going to https://vm-manage.oit.duke.edu/containers and entering your NetID.
  • Click on Docker
  • Click on RStudio
  • When RStudio loads, restart the R session (Ctrl/Cmd+Shift+F10), clear the console (Ctrl/Cmd+L), and clear your workspace

Open your project

Is your project still open? If not, click on the project icon to load it. (Don’t create a new one.)

Download and open the template

Run the following code in your console. Change products to your preferred subfolder.

  download.file("https://tinyurl.com/y7tmv7uh", 
                destfile = "products/lab-w05.Rmd")

Optional: Change your layout

Tools > Global Options

Tidy Data

Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table (Wickham 2014)

Wickham also uses tidy as a verb meaning “to structure a dataset to facilitate analysis”.

Messy Data 1: Column Headers Are Values

How many variables are in this dataset? This table and the following from Wickham (2014).

Tidy Data 1: Melt

Columns are variables and rows are observations (i.e., combination of religion and income):

Messy Data 2: Multiple variables stored in one column

How would you link up with population data by country and age group to calculate rates? (answer: after tidying)

Tidy Data 2: Melt and Split

Messy Data 3: Variables are stored in both rows and columns

The element column is not a variable with values, but rather a vector of variable names.

Tidy Data 3: Melt and Cast

Every row becomes an observation (i.e., weather station by date) with two measurements:

Messy Data 4: Multiple types in one table

This style of data entry and storage invites errors and inconsistencies:

Tidy Data 4: Use different tables for different levels

Don’t Cry, Tidy!

The Tidy 4

function purpose
gather() Gather variable values spread across multiple columns
spread() Spread out observation values scattered across rows
separate() Split one column into two or more columns
unite() Collapse multiple columns into one column

(http://r4ds.had.co.nz/tidy-data.html)

Some Simple Examples

Load the tidyverse package and look at the included dataset called table4a.

  library(tidyverse)
  table4a
## # A tibble: 3 x 3
##       country `1999` `2000`
## *       <chr>  <int>  <int>
## 1 Afghanistan    745   2666
## 2      Brazil  37737  80488
## 3       China 212258 213766

What is messy about this tibble?

gather()

1999 and 2000 are values, not variables. The variable should be year.

Examples and figures from Wickham and Grolemund (2017)

gather()

  table4a %>%  # remember pipes?
    gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3
##       country  year  cases
##         <chr> <chr>  <int>
## 1 Afghanistan  1999    745
## 2      Brazil  1999  37737
## 3       China  1999 212258
## 4 Afghanistan  2000   2666
## 5      Brazil  2000  80488
## 6       China  2000 213766

We gather values by a set of column names that define the key, which we name year. We then store the values in a new variable called cases.

gather()

We get the same result excluding country with a - sign. Note that we’ve also removed the parameter names key and value. This is possible because the default for gather() expects you to specify key before value. See ?gather.

  table4a %>%  # remember pipes?
    gather("year", "cases", -country)
## # A tibble: 6 x 3
##       country  year  cases
##         <chr> <chr>  <int>
## 1 Afghanistan  1999    745
## 2      Brazil  1999  37737
## 3       China  1999 212258
## 4 Afghanistan  2000   2666
## 5      Brazil  2000  80488
## 6       China  2000 213766

gather()

Try gathering table4b (also included with tidyverse) to create variables for year and population values (call it “population”).

  table4b
## # A tibble: 3 x 3
##       country     `1999`     `2000`
## *       <chr>      <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2      Brazil  172006362  174504898
## 3       China 1272915272 1280428583

spread()

Now look at table2. type is not a variable!

  table2
## # A tibble: 12 x 4
##        country  year       type      count
##          <chr> <int>      <chr>      <int>
##  1 Afghanistan  1999      cases        745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000      cases       2666
##  4 Afghanistan  2000 population   20595360
##  5      Brazil  1999      cases      37737
##  6      Brazil  1999 population  172006362
##  7      Brazil  2000      cases      80488
##  8      Brazil  2000 population  174504898
##  9       China  1999      cases     212258
## 10       China  1999 population 1272915272
## 11       China  2000      cases     213766
## 12       China  2000 population 1280428583

spread()

Let’s use spread() to make two proper variables: cases and population.

spread()

The key in this example is the type column, and the values we want to spread are stored in count.

  table2
## # A tibble: 12 x 4
##        country  year       type      count
##          <chr> <int>      <chr>      <int>
##  1 Afghanistan  1999      cases        745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000      cases       2666
##  4 Afghanistan  2000 population   20595360
##  5      Brazil  1999      cases      37737
##  6      Brazil  1999 population  172006362
##  7      Brazil  2000      cases      80488
##  8      Brazil  2000 population  174504898
##  9       China  1999      cases     212258
## 10       China  1999 population 1272915272
## 11       China  2000      cases     213766
## 12       China  2000 population 1280428583

spread()

Just like with gather(), we’ll pass two arguments to spread() in addition to an object: key and value.

  spread(table2, key = type, value = count)
## # A tibble: 6 x 4
##       country  year  cases population
## *       <chr> <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

spread out wide, or gather (stack) into a long (tall) pile

split and combine columns

The separate() and unite() functions split and combine columns.

separate()

Here we want split rate into two columns: cases and population

separate()

It’s simple. Just tell R the column to split and the columns to create:

  table3 %>% 
    separate(col = rate, into = c("cases", "population"))
## # A tibble: 6 x 4
##       country  year  cases population
## *       <chr> <int>  <chr>      <chr>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

separate()

separate() is smart enough to know to split on “/”, but you can also specify with sep = "/". The convert==TRUE argument will guess at the data type rather than leave the new columns characters.

  table3 %>% 
    separate(col = rate, into = c("cases", "population"), 
             sep="/", convert = TRUE)
## # A tibble: 6 x 4
##       country  year  cases population
## *       <chr> <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

separate()

Can also separate on a specific character position with the sep argument. For example, we separate year into century and year by specifying sep==2. Count from left (1, 2, …) or from right (-1, -2, …).

  table3 %>% 
    separate(col = year, into = c("century", "year"), sep=2)
## # A tibble: 6 x 4
##       country century  year              rate
## *       <chr>   <chr> <chr>             <chr>
## 1 Afghanistan      19    99      745/19987071
## 2 Afghanistan      20    00     2666/20595360
## 3      Brazil      19    99   37737/172006362
## 4      Brazil      20    00   80488/174504898
## 5       China      19    99 212258/1272915272
## 6       China      20    00 213766/1280428583

unite()

unite() does the opposite: it combines two or more columns into one.

unite()

In this example, new is the name of the new column we want to create by combining columns century and year.

  table5 %>% 
    unite(col = new, century, year)
## # A tibble: 6 x 3
##       country   new              rate
## *       <chr> <chr>             <chr>
## 1 Afghanistan 19_99      745/19987071
## 2 Afghanistan 20_00     2666/20595360
## 3      Brazil 19_99   37737/172006362
## 4      Brazil 20_00   80488/174504898
## 5       China 19_99 212258/1272915272
## 6       China 20_00 213766/1280428583

unite()

By default unite() will add an underscore, but adding sep="" (no space) will combine without the _.

  table5 %>% 
    unite(col = new, century, year, sep = "")
## # A tibble: 6 x 3
##       country   new              rate
## *       <chr> <chr>             <chr>
## 1 Afghanistan  1999      745/19987071
## 2 Afghanistan  2000     2666/20595360
## 3      Brazil  1999   37737/172006362
## 4      Brazil  2000   80488/174504898
## 5       China  1999 212258/1272915272
## 6       China  2000 213766/1280428583

Missing values

One last point to consider is how to handle missing values when transforming data. Let’s make a tibble called stocks with 2 years of quarterly data on returns.

  stocks <- tibble(
    year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
    qtr    = c(   1,    2,    3,    4,    2,    3,    4),
    return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)

Missing values

There are two types of missing data:

  1. Explicit: Q4 of 2015 is present in the data, but there is no value.
  2. Implicit: Q1 of 2016 is not even present in the data

Make missing explicit

complete() returns all combinations of a set of columns and will fill in NA when combinations are missing.

  stocks %>% 
    complete(year, qtr)
## # A tibble: 8 x 3
##    year   qtr return
##   <dbl> <dbl>  <dbl>
## 1  2015     1   1.88
## 2  2015     2   0.59
## 3  2015     3   0.35
## 4  2015     4     NA
## 5  2016     1     NA
## 6  2016     2   0.92
## 7  2016     3   0.17
## 8  2016     4   2.66

Our objective

Now we want to use the disclosure data from the first wave to create a participant characteristics table:

variable disclosed non-disclosed
c.age 12.3 11.7
c.female 50.8 51.6
p.age 47.1 49.0
p.female 86.8 89.3

Get the data

If you completed the Week 4 tutorial, you should already have the data file you need for today. If not, run the download chunk in your template.

Once you have the data (a csv file), import it into R and assign to an object called dat.

Import

  dat <- read.csv("input/r2d2-w04.csv", stringsAsFactors = FALSE)

Start with a simple summary

Let’s start by remembering how to summarize our data. Use the dplyr package to get the means of c.age.r1 and p.age.r1. What functions can you use?

Start with a simple summary

You can use summarise() and mean() to get the means of c.age.r1 and p.age.r1.

  summarise(dat, c.age=mean(c.age.r1, na.rm=TRUE),
                 p.age=mean(p.age.r1, na.rm=TRUE))
##      c.age    p.age
## 1 12.12366 47.75269

The same thing with a pipe!

  dat %>%
  # notice we omit the object name dat since piped
    summarise(c.age=mean(c.age.r1, na.rm=TRUE),
              p.age=mean(p.age.r1, na.rm=TRUE))
##      c.age    p.age
## 1 12.12366 47.75269

Create an indicator for knows status

We want to compare participant characteristics by the disclosure status of the child at baseline. The relevant variable is doesKnow16.r1. What can you add before summarise() to create this indicator variable? Use table() to remind yourself how this variable was coded.

  table(dat$doesKnow16.r1)
## 
## child might know or suspect     no, child does not know 
##                          19                          92 
##            yes, child knows 
##                         250

Create an indicator for knows status

  dat %>%
    mutate(p.disclosed = case_when(
              doesKnow16.r1=="yes, child knows" ~ 1, 
              TRUE ~ 0
              )) %>%
  glimpse() # only including here to show the new variable
## Observations: 372
## Variables: 13
## $ c.age.r1      <int> 10, 11, 10, 12, 15, 14, 12, 10, 9, 15, 14, 11, 1...
## $ c.female.r1   <int> 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, ...
## $ p.age.r1      <int> 77, 36, 39, 34, 39, 45, 49, 33, 34, 66, 57, 60, ...
## $ p.female.r1   <int> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ doesKnow12.r1 <chr> "yes, child knows", "yes, child knows", "yes, ch...
## $ doesKnow15.r1 <int> 6, 7, 6, NA, 13, 6, 9, 5, NA, NA, 11, 5, 10, 11,...
## $ doesKnow16.r1 <chr> "yes, child knows", "yes, child knows", "yes, ch...
## $ doesKnow19.r1 <int> 6, 7, 8, NA, 13, 11, 9, 8, NA, NA, 11, 11, 10, N...
## $ doesKnow20.r1 <chr> "yes, child knows", "no, child does not know", "...
## $ doesKnow23.r1 <int> 7, NA, 9, NA, 13, NA, 9, 8, NA, NA, 11, NA, 10, ...
## $ doesKnow24.r1 <chr> "no, child does not know", NA, "yes, child knows...
## $ doesKnow27.r1 <int> NA, NA, 8, NA, 13, 12, 9, 8, NA, NA, 11, NA, 10,...
## $ p.disclosed   <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, ...

Now summarize by disclosure status

Add something after mutate() to prepare to summarize the data.

Now summarize by disclosure status

The group_by() function can be used with summarise() to tell R to perform the summary by disclosure status.

  dat %>%
    mutate(p.disclosed = case_when(
              doesKnow16.r1=="yes, child knows" ~ "disclosed", 
              TRUE ~ "non-disclosed"
              )) %>%
    group_by(p.disclosed) %>%
    summarise(c.age=mean(c.age.r1, na.rm=TRUE),
              p.age=mean(p.age.r1, na.rm=TRUE))
## # A tibble: 2 x 3
##     p.disclosed    c.age  p.age
##           <chr>    <dbl>  <dbl>
## 1     disclosed 12.33200 47.144
## 2 non-disclosed 11.69672 49.000

Add other variables to summarize

Use names() to remind yourself of the variables in dat. You can skip the “doesKnow” variables.

  names(dat)
##  [1] "c.age.r1"      "c.female.r1"   "p.age.r1"      "p.female.r1"  
##  [5] "doesKnow12.r1" "doesKnow15.r1" "doesKnow16.r1" "doesKnow19.r1"
##  [9] "doesKnow20.r1" "doesKnow23.r1" "doesKnow24.r1" "doesKnow27.r1"

Add other variables to summarize

  dat %>%
    mutate(p.disclosed = case_when(
              doesKnow16.r1=="yes, child knows" ~ "disclosed", 
              TRUE ~ "non-disclosed"
              )) %>%
    group_by(p.disclosed) %>%
    summarise(c.age=mean(c.age.r1, na.rm=TRUE),
              p.age=mean(p.age.r1, na.rm=TRUE),
              c.female=mean(c.female.r1, na.rm=TRUE),
              p.female=mean(p.female.r1, na.rm=TRUE))
## # A tibble: 2 x 5
##     p.disclosed    c.age  p.age  c.female  p.female
##           <chr>    <dbl>  <dbl>     <dbl>     <dbl>
## 1     disclosed 12.33200 47.144 0.5080000 0.8680000
## 2 non-disclosed 11.69672 49.000 0.5163934 0.8934426

Check out your tibble

Formatting aside, does this tibble resemble a table of participant characteristics you might see in an article?

## # A tibble: 2 x 5
##     p.disclosed    c.age  p.age  c.female  p.female
##           <chr>    <dbl>  <dbl>     <dbl>     <dbl>
## 1     disclosed 12.33200 47.144 0.5080000 0.8680000
## 2 non-disclosed 11.69672 49.000 0.5163934 0.8934426

Not quite

What functions from tidyr can help us get there?

Use gather()

  dat %>%
    mutate(p.disclosed = case_when(
              doesKnow16.r1=="yes, child knows" ~ "disclosed", 
              TRUE ~ "non-disclosed"
              )) %>%
    group_by(p.disclosed) %>%
    summarise(c.age=mean(c.age.r1, na.rm=TRUE),
              p.age=mean(p.age.r1, na.rm=TRUE),
              c.female=mean(c.female.r1, na.rm=TRUE),
              p.female=mean(p.female.r1, na.rm=TRUE)) %>%
    gather(., key="variable", value="value", -p.disclosed)
## # A tibble: 8 x 3
##     p.disclosed variable      value
##           <chr>    <chr>      <dbl>
## 1     disclosed    c.age 12.3320000
## 2 non-disclosed    c.age 11.6967213
## 3     disclosed    p.age 47.1440000
## 4 non-disclosed    p.age 49.0000000
## 5     disclosed c.female  0.5080000
## 6 non-disclosed c.female  0.5163934
## 7     disclosed p.female  0.8680000
## 8 non-disclosed p.female  0.8934426

Are we there yet?

## # A tibble: 8 x 3
##     p.disclosed variable      value
##           <chr>    <chr>      <dbl>
## 1     disclosed    c.age 12.3320000
## 2 non-disclosed    c.age 11.6967213
## 3     disclosed    p.age 47.1440000
## 4 non-disclosed    p.age 49.0000000
## 5     disclosed c.female  0.5080000
## 6 non-disclosed c.female  0.5163934
## 7     disclosed p.female  0.8680000
## 8 non-disclosed p.female  0.8934426

What do we need to do?

Use spread()

  dat %>%
    mutate(p.disclosed = case_when(
              doesKnow16.r1=="yes, child knows" ~ "disclosed", 
              TRUE ~ "non-disclosed"
              )) %>%
    group_by(p.disclosed) %>%
    summarise(c.age=mean(c.age.r1, na.rm=TRUE),
              p.age=mean(p.age.r1, na.rm=TRUE),
              c.female=mean(c.female.r1, na.rm=TRUE),
              p.female=mean(p.female.r1, na.rm=TRUE)) %>%
    gather(., key="variable", value="value", -p.disclosed) %>%
    spread(., key=p.disclosed, value=value)
## # A tibble: 4 x 3
##   variable disclosed `non-disclosed`
## *    <chr>     <dbl>           <dbl>
## 1    c.age    12.332      11.6967213
## 2 c.female     0.508       0.5163934
## 3    p.age    47.144      49.0000000
## 4 p.female     0.868       0.8934426

Make it a bit nicer

  dat %>%
    mutate(p.disclosed = case_when(
              doesKnow16.r1=="yes, child knows" ~ "disclosed", 
              TRUE ~ "non-disclosed"
              )) %>%
    group_by(p.disclosed) %>%
    summarise(c.age=round(mean(c.age.r1, na.rm=TRUE), 1),
              p.age=round(mean(p.age.r1, na.rm=TRUE), 1),
              c.female=round(mean(c.female.r1, na.rm=TRUE)*100, 1),
              p.female=round(mean(p.female.r1, na.rm=TRUE)*100, 1)) %>%
    gather(., key="variable", value="value", -p.disclosed) %>%
    spread(., key=p.disclosed, value=value)
## # A tibble: 4 x 3
##   variable disclosed `non-disclosed`
## *    <chr>     <dbl>           <dbl>
## 1    c.age      12.3            11.7
## 2 c.female      50.8            51.6
## 3    p.age      47.1            49.0
## 4 p.female      86.8            89.3

References

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23. doi:10.18637/jss.v059.i10.

Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science. O’Reilly. http://r4ds.had.co.nz/.